Analyzing Dependencies of Japanese Subordinate Clauses based on Statistics of Scope Embedding Preference
نویسندگان
چکیده
This paper proposes a statist ical method for learning dependency preference of Japanese subordinate clauses, in which scope embedding preference of subordinate clauses is exploited as a useful information source for disambiguating dependencies between subordinate clauses. Es t imated dependencies of subordinate clauses successfully increase the precision of an existing statistical dependency analyzer. 1 I n t r o d u c t i o n In the Japanese language, since word order in a sentence is relatively free compared with European languages, dependency analysis has been shown to be practical and effective in bo th rulebased and stochastic approaches to syntactic analysis. In dependency analysis of a Japanese sentence, among various source of ambiguities in a sentence, dependency ambiguities of subordinate clauses are one of the most problematic ones, par t ly because word order in a sentence is relatively free. In general, dependency ambiguities of subordinate clauses cause scope ambiguities of subordinate clauses, which result in enormous number of syntact ic ambiguities of other types of phrases such as noun phrases. 1 1In our preliminary corpus analysis using the stochastic dependency analyzer of Fujio and Matsumoto (1998), about 30% of the 210,000 sentences in EDR bracketed corpus (EDR, 1995) have dependency ambiguities of subordinate clauses, for which the precision of chunk (bunsetsu) level dependencies is about 85.3% and that of sentence level is about 25.4% (for best one) ~ 35.8% (for best five), while for the rest 70% of EDR bracketed corpus, the precision of chunk (bunsetsu) level dependencies is about 86.7% and that of sentence level is about 47.5% (for best one) ~ 60.2% (for best five). In addition to that, when assuming that those ambiguities of subordinate clause dependencies are initially resolved in some way, the chunk level precision increases to 90.4%, and the sentence level precision to 40.6% (for best one) ~ 67.7% (for best five). This result of our preliminary analysis 110 In the Japanese linguistics, a theory of Minami (1974) regarding scope embedding preference of subordinate clauses is well-known. Minami (1974) classifies Japanese subordinate clauses according to the breadths of their scopes and claim that subordinate clauses which inherently have narrower scopes are embedded within the scopes of subordinate clauses which inherently have broader scopes (details are in section 2). By manually analyzing several raw corpora, Minami (1974) classifies various types of Japanese subordinate clauses into three categories, which are total ly ordered by the embedding relation of their scopes. In the Japanese computat ional linguistics community, Shirai et al. (1995) employed Minami (1974)'s theory on scope embedding preference of Japanese subordinate clauses and applied it to rule-based Japanese dependency analysis. However, in their approach, since categories of subordinate clauses are obtained by manually analyzing a small number of sentences, their coverage against a large corpus such as E D R bracketed corpus (EDR, 1995) is quite low. 2 In order to realize a broad coverage and high performance dependency analysis of Japanese sentences which exploits scope embedding preference of subordinate clauses, we propose a corpus-based and statistical alternative to the rule-based manual approach (section 3). 3 clearly shows that dependency ambiguities of subordinate clauses are among the most problematic source of syntactic ambiguities in a Japanese sentence. 2In our implementation, the coverage of the categories of Shirai et al. (1995) is only 30% for all the subordinate clauses included in the whole EDR corpus. ~Previous works on statistical dependency analysis include Fujio and Matsumoto (1998) and Haruno et al. (1998) in Japanese analysis as well as Lafferty et al. (1992), Eisner (1996), and Collins (1996) in English analysis. In later sections, we discuss the advantages of our approach over several closely related previous works. Table 1: Word Segmentation, POS tagging, and Bunsetsu Segmentation of A Japanese Sentence Word Segmentation POS (+ conjugation form) Tagging Bunsetsu Segmentation Tenki ga yoi kara dekakeyou noun caseadjective predicateverb particle (base) conjunctive-particle (volitional) Tenki-ga yoi-kara dekakeyou (Chunking) English Translation weather subject fine because let's go out (Because the weather is fine, let's go out.) First, we formalize the problem of deciding scope embedding preference as a classification problem, in which various types of linguistic information of each subordinate clause are encoded as features and used for deciding which one of given two subordinate clauses has a broader scope than the other. As in the case of Shirai et al. (1995), we formalize the problem of deciding dependency preference of subordinate clauses by utilizing the correlation of scope embedding preference and dependency preference of Japanese subordinate clauses. Then, as a statistical learning method, we employ the decision list learning method of Yarowsky (1994), where optimal combination of those features are selected and sorted in the form of decision rules, according to the strength of correlation between those features and the dependency preference of the two subordinate clauses. We evaluate the proposed method through the experiment on learning dependency preference of Japanese subordinate clauses from the EDR bracketed corpus (section 4). We show that the proposed method outperforms other related methods/models. We also evaluate the estimated dependencies of subordinate clauses in Fujio and Matsumoto (1998)'s framework of the statistical dependency analysis of a whole sentence, in which we successfully increase the precisions of both chunk level and sentence level dependencies thanks to the estimated dependencies of subordinate clauses. 2 A n a l y z i n g D e p e n d e n c i e s b e t w e e n J a p a n e s e S u b o r d i n a t e C l a u s e s b a s e d o n S c o p e E m b e d d i n g P r e f e r e n c e 2.1 D e p e n d e n c y A n a l y s i s o f A Japanese S e n t e n c e First, we overview dependency analysis of a Japanese sentence. Since words in a Japanese sentence are not segmented by explicit delimiters, input sentences are first word segmented, 111 Phrase Structure Scope of Subordin.~.ff..f.~... ( !(ffenki-ga) (yO~: r a ) ) [ (dekakeyou)) t Dependency (modification) Relation Figure 1: An Example of Japanese Subordinate Clause (taken from the Sentence of Table 1) part-of-speech tagged, and then chunked into a sequence of segments called bunsetsus. 4 Each chunk (bunsetsu) generally consists of a set of content words and function words. Then, dependency relations among those chunks are est imated, where most practical dependency analyzers for the Japanese language usually assume the following two constraints: 1. Every chunk (bunsetsu) except the last one modifies only one posterior chunk (bunsetsu). 2. No modification crosses to other modifications in a sentence. Table 1 gives an example of word segmentation, part-of-speech tagging, and bunsetsu segmentat ion (chunking) of a Japanese sentence, where the verb and the adjective are tagged with their parts-of-speech as well as conjugation forms. Figure i shows the phrase structure, the bracketing, 5 and the dependency (modification) relation of the chunks (bunsetsus) within the sentence. 4Word segmentation and part-of-speech tagging are performed by the Japanese morphological analyzer Chasen (Matsumoto et al., 1997), and chunking is done by the preprocessor used in Fujio and Matsumoto (1998). 5The phrase structure and the bracketing are shown just for explanation, and we do not consider them but consider only dependency relations in the analysis throughout this paper. A Japanese subordinate clause is a clause whose head chunk satisfies the following properties.
منابع مشابه
Learning Preference of Dependency between Japanese Subordinate Clauses and its Evaluation in Parsing
Utsuro et al., 2000) proposed statistical method for learning dependency preference of Japanese subordinate clauses, in which scope embedding preference of subordinate clauses is exploited as a useful information source for disambiguating dependencies between subordinate clauses. Following (Utsuro et al., 2000), this paper presents detailed results of evaluating the proposed method by comparing...
متن کاملFrom Performance Principles of Word Order , edited
I present evidence in this paper for a universal preference for clause-initial adverbial subordinators (subordinate conjunctions marking subordinate clauses) over clause-final subordinators. The evidence cited is based on a database containing word order characteristics for a crosslinguistic sample of 625 languages (cf. Dryer 1989b, 1991, 1992). This preference is somewhat similar to a preferen...
متن کاملThe Effect of Present Activity Verbs on Processing Structural Ambiguity in Japanese Garden-Path Sentences
This paper addresses the semantics of the present form (known as the -ru form) of activity verbs in Japanese and examines the effect of these verbs in contrast to that of the inflected form (the -ta form). Garden-path sentences involving an ambiguity between a simple sentential reading and a relative clause reading generally show a preference for the former reading; when the preferred reading p...
متن کاملStudy on the English Corresponding Unit of Chinese Clause
This paper annotates the English corresponding units of Chinese clauses in Chinese-English translation and statistically analyzes them. Firstly, based on Chinese clause segmentation, we segment English target text into corresponding units (clause) to get a Chinese-to-English clause-aligned parallel corpus. Then, we annotate the grammatical properties of the English corresponding clauses in the ...
متن کاملDetection of perturbed quantization (PQ) steganography based on empirical matrix
Perturbed Quantization (PQ) steganography scheme is almost undetectable with the current steganalysis methods. We present a new steganalysis method for detection of this data hiding algorithm. We show that the PQ method distorts the dependencies of DCT coefficient values; especially changes much lower than significant bit planes. For steganalysis of PQ, we propose features extraction from the e...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000